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ABSTRACT 

Distribution maps are generally based on documented records rather than true occurrence patterns. This may be problematic 
for cryptic, under-reported species that occur in areas poorly covered by observers. Species distribution models may help 
overcome this challenge. Here, all available records of the migratory Anthus trivialis (tree pipit) and resident Anthus nyassae 
(wood pipit) for southern Africa and adjacent areas were assembled to train generalised linear models, random forest and 
gradient boosting machine species distribution models. Sampling pseudo-absences from a common species’ similarly biased 
records helped to account for the spatial sampling bias present in the data. The model outputs suggest that A. trivialis and 
A. nyassae display a latitudinal habitat suitability gradient in the area of interest, opposing a latitudinal reporting gradient. The 
migratory behaviour of A. trivialis may blur its ecological niche. More and more reliable field observations are needed to 
confirm these findings. This study provides a clear framework to assist distribution delimitations from citizen science data by 
counteracting observer and sampling biases. 


Keywords: Anthus nyassae, Anthus trivialis, citizen science, distribution, ecological niche model, species distribution model 


INTRODUCTION and distribution of A. trivialis and A. nyassae in sub- 
Saharan Africa (Clancey 1987, 1989, 1990, Adams et 
Delimiting species distributions can be a challenging al. 2022). The available distribution maps of the two 
endeavour, as it attempts to discretize different levels species vary significantly between sources due to 
of abundance from data that are often incomplete. poor observer coverage in certain areas (Clancey 
The recent emergence of citizen-science platforms 1987, 1989, 1990, BirdLife International 2016, 
such as SABAP2 (Second Southern African Bird 2018). Furthermore, both species may be challenging 
Atlas Project) (Brooks and Ryan 2023) has generated to identify due to their cryptic appearance. They thus 
a wealth of data that can help improve the provide good examples and materials for testing 
delimitation of distributions. However, sampling and whether available records spatially reflect their 
observer biases in the data (Kosmala et al. 2016) may ecological niche, as modelled by SDMs. 
distort the results of these efforts. As such, 
distribution maps based on citizen science, while METHODS 
spatially detailed compared to broad, expert-drawn 
range maps, may condense observations rather than The study area covered Angola, Zambia, Malawi, 
delimiting the true occurrence patterns of a species. Mozambique and countries to the south within which 
the area of interest was further delimited by both data 
SDMs (Species Distribution Models, introduced by availability and the centroids of the distribution of 
Guisan et al. 2017, Guisan and Zimmermann 2000) A.nyassae and the wintering distribution of 
may help to overcome this challenge by generating A. trivialis, respectively. All available records of the 
habitat models through the correlation of a taxon’s two species in the area of interest were gathered by 
current presence or absence with prevailing consulting eBird (Auer et al. 2022), GBIF (Global 
environmental conditions. Thus, the models can be Biodiversity Information Facility) (multiple sources 
useful in detecting areas in which species may be outlined below), accessed through - rgbif 
under-recorded. (Chamberlain et al. 2023), iNaturalist (iNaturalist 
contributors 2023), ABAP (African Bird Atlas 
Anthus trivialis (tree pipit) is a non-breeding Project), accessed through SABAP2 (Brooks and 
palearctic migrant in sub-Saharan Africa, mainly Ryan 2023) and termed SABAP2 thereafter, SARBN 
present from October to March. It shares its (Southern African Rare Bird News) (SARBN 2023), 
woodland habitat with the resident Anthus nyassae SAFRING (South African bird ringing unit) 
(wood pipit). Few studies have examined the status (SAFRING 2023) and BirdPix (Navarro 2023). 
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Except for the SAFRING data and some GBIF 
entries, which are part of museum collections or other 
scientific occurrence datasets (multiple sources 
outlined below), all records are citizen science based. 
The datasets were cleaned of duplicates and merged 
(Table 1). Although partially overlapping, eBird and 
GBIF provided the most data, followed by SABAP2. 
The majority of data were recorded in recent years. 


Based on the ecology of A. nyassae and A. trivialis 
(Chittenden et al. 2018), seven environmental 
predictors were selected (Table 2) that appeared 
meaningful in determining the ecological niche of the 
two species. Although A. trivialis is only present in 
the region during the local summer months, the 
selected predictors quantifying precipitation and 
temperature span the whole year for both species, as 
the prevailing environmental conditions during the 
local summer months are dictated by the climatic 
conditions across all seasons. For example, winter 
temperatures may affect food availability during the 
summer, and the vegetation in winter rainfall areas 
does not rely on precipitation during the presence of 
A. trivialis. Furthermore, rare overwintering birds 
have been recorded (Chittenden et al. 2018). 


All data were acquired and converted into rasters 
with a spatial resolution of 1 km using Google Earth 
Engine __ (https://earthengine.google.com). § The 
predictors and records were loaded into the 
R statistical package (R Core Team 2018). An auto- 
correlation analysis yielded Pearson’s correlation 
coefficients below 0.6 for all combinations, ensuring 
limited covariance between predictors. Duplicate 
data points at 1km resolution were removed. 


Sampled SABAP2 occurrences of the fork-tailed 
drongo (Dicrurus adsimilis) (Brooks and Ryan 
2023), a common bird present throughout most of 
southern Africa, served as pseudo-absences to 
counter the spatial sampling bias in the available 
records of A. nyassae and A. trivialis (Kramer-Schadt 
et al. 2013). The apparent habitat requirement of 
D. adsimilis is the presence of wooded cover, the 
same prerequisite for the occurrence of A. nyassae 
and A. trivialis (Chittenden et al. 2018). Thus, this 
step assumes that the absence of an observation of 
D. adsimilis at a given location implies a high 
probability that the location has not been sufficiently 
covered by observers to detect the presence of 
A. nyassae and A. trivialis. 


Three distinct SDMs were run for each species in R 
(R Core Team 2018), namely GLM (Generalised 
Linear Model), RF (Random Forest), and GBM 
(Gradient Boosting Machine) models. For the former, 
a set of 2,500 pseudo-absences was used, while for 
the two latter, this number was reduced to 500 to 
approximately match the number of presences, as 
recommended for tree-based algorithms (Guisan et 
al. 2017). The GLMs were fitted using linear and 
quadratic terms as well as a stepwise variable 
selection based on the AIC (Akaike Information 
Criterion). A minimum of 10 observations was kept 
for every node, and 500 trees were grown for the RF 
models using the ranger package (Wright et al. 2023). 
The GBMs were fitted with a minimum of 10 
observations per node, 500 trees, a learning rate of 
0.1, and 10 cross-validation folds using the gbm 
package (Greenwell et al. 2022). All models were 
trained on all available occurrences except for the 


Table 1: Number of reported occurrences of Anthus nyassae and A. trivialis and time spans covered by cleaned datasets used 
in species distribution models. The indicated time span for GBIF is based on a minority of dated records. N/A: not applicable. 


Hain aac Anthus nyassae Anthus trivialis 
Records Years Records Years 

eBird 315 1971-2023 137 1971-2023 
GBIF 268 (2015-2023) 206 (2015-2023) 
iNaturalist 22 2011-2023 16 2014-2023 
SABAP2 127 2008-2023 60 2010-2023 
SARBN N/A N/A 6 2013-2019 
SAFRING; BirdPix 33 2006-2021 14 1960-2022 
Total 717 1971-2023 418 1960-2023 


Table 2: Environmental predictors included in the species distribution models for Anthus nyassae and A. trivialis. An 
autocorrelation test yielded Pearson’s correlation coefficients below 0.6 for all combinations. 


Predictor Ecological Scale Source 
Annual mean temperature Climatic Karger et al. (2017) 
Annual precipitation Climatic Karger et al. (2017) 
Elevation Topographic Amatulli et al. (2021) 
Tree density Ecological Crowther et al. (2015) 
Leaf area index Ecological Myneni et al. (2021) 
Human development Anthropogenic Tuanmu & Jetz (2014) 
Landscape intactness Anthropogenic Potapov et al. (2008) 
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Figure 1: Predictor importance across three species distribution models of two species of pipits (Anthus nyassae and 
A. trivialis). Generally, the climatic predictors (annual mean temperature and annual precipitation) performed best in 
explaining habitat suitability for both species, followed by topography and leaf area index. The tree-based algorithms 
(RF: Random Forest model; GBM: Gradient Boosting Machine model) yielded more balanced predictor importance values 
than the GLM (Generalised Linear Model). For the latter, only the coefficients of the linear terms are illustrated, as all 
regression coefficients of the quadratic terms were < 0.01. Furthermore, the stepwise variable selection based on Akaike 
Information Criterion excluded tree density and human development from the models. 
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Figure 2: Receiver Operating Characteristic (ROC) curves and Area Under Curve (AUC) values for three species distribution 
models of two species of pipits (Anthus nyassae and A. trivialis), generated from five-fold cross-validation. All models 
performed well for both species (Swets 1988). GLM: Generalised Linear Model; RF: Random Forest model; GBM: Gradient 
Boosting Machine model. 


SAFRING (SAFRING 2023) and BirdPix (Navarro A. trivialis, the GBM was chosen based on model 
2023) datapoints. The latter were used for a visual performance (high AUC values) and conservatism in 
comparison with the model predictions (Figure 1), as predicting suitability (the model predicts high 
ringing data and reports covered by photographs are suitability less often). 

more reliable than ordinary citizen science records. 

The low number of data points and the sampling bias RESULTS 

present in the SAFRING and BirdPix data prevented 

a computationally independent validation approach. All models performed well in terms of predictive 
Instead, a five-fold cross-validation was used to performance. The Area Under Curve (AUC) of the 
evaluate the models based on the ROC (Receiver Receiver Operating Characteristic (ROC) curves 
Operating Characteristic) curve and AUC (Area were all above 0.9, indicating good predictions 
Under Curve) values (Figure 2). The generated (Swets 1988) (Figure 1). The climatic predictors 
prediction layers were visualised using QGIS (QGIS performed best, and the anthropogenic variables 
Development Team 2023). For A. nyassae, the RF (Table 2) performed worst in explaining habitat 
model was used to produce a projection, while for suitability for both species (Figure 2). Predictor 
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Figure 3: The white circles locate all available records for A. nyassae (left) and A. trivialis (right) in southern Africa and adjacent 
countries. The red polygons correspond to the (wintering for A. trivialis) distribution (BirdLife International 2016, 2018). The 
shades of green (A. nyassae) and blue (A. trivialis) show the potential distribution based on habitat suitability, as suggested by 


the random forest model (A. nyassae) and gradient boosting machine (A. trivialis) models. 


importance values were generally lower for 
A. trivialis than for A. nyassae, coinciding with 
slightly lower AUC values for A. trivialis (Figure 1, 
Figure 2). 


The resulting habitat suitability maps (Figure 3) 
suggest that A. nyassae and A. trivialis display a 
latitudinal occurrence probability gradient: both 
species appear to be rare in the southern parts of the 
study area and more common further north. This 
contrasts with a latitudinal reporting gradient, as 
relatively few occurrences have been reported from 
potentially suitable areas such as Angola or northern 
Zambia. 


DISCUSSION 


The output suggests that current distribution maps 
exclude areas suitable for the potential occurrence of 
A. trivialis and A. nyassae, perhaps because they have 
been poorly covered by observers. Further 
exploration of the areas in question may yield new 
records of both species. 
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The strong model performances (Swets 1988) 
indicate a clear delimitation of the ecological niches 
of both species: the birds can generally be found in 
broadleaved woodlands at 800 or more meters above 
sea level, with at least 500 mm of annual rainfall. 
Unlike the resident A. nyassae, A. trivialis is only 
present in sub-Saharan Africa from October to 
March. The migratory behaviour may be reflected in 
the marginally poorer performance of the models in 
predicting its presence or absence (Figure 1) and the 
generally slightly lower predictor importances 
(Figure 2). Temporal fluctuations in the migratory 
patterns of A. trivialis due to varying food availability 
or weather conditions between years may further blur 
the picture. 


Several sources of bias in the occurrence data need to 
be considered to contextualise the model outputs. 
Observer biases are inevitable in the context of both 
citizen science projects, such as eBird (Auer et al. 
2022), iNaturalist (iNaturalist contributors 2023), or 
SABAP2 (Brooks & Ryan 2023), and platforms that 
are partially fed by citizen science, including GBIF 
(multiple sources outlined below). While the amount 
of data produced by many citizen scientists may 
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make up for the quality trade-off (Kosmala et al. 
2016), data quality control mechanisms may not 
always be able to flag faulty records. Anthus pipits 
can be notoriously difficult to identify even for 
specialists; hence, citizen science records should be 
approached with cautious scepticism. Although 
scarce, data provided by ringers or backed with 
photographs are far more reliable and may help to 
validate both other records and model outputs 
(Figure 2). 


Furthermore, occurrence data may often be subject to 
a strong sampling bias as more accessible areas 
attract more observers (Kosmala et al. 2016). As 
opposed to SABAP1 (1987-1991), SABAP2 (Brooks 
& Ryan 2023) does not entail spatially systematic 
observations throughout the region (Bonnevie 2011). 
Instead, the observer decides where to observe. As 
such, the discrepancy between the model outputs and 
the number of reported sightings of A. trivialis around 
Gauteng, South Africa, is somewhat expected. On the 
other hand, the lack of records from central Angola 
for both species may be due to a lower observer 
density and less to the species absence. Choosing to 
sample pseudo-absences from a common species that 
shows the same sampling bias as the target species 
appears to be an efficient strategy to counteract this 
constraint in the modelling process (Kramer-Schadt 
et al. 2013). 


The conclusions suggested by the model outputs need 
to be considered with caution. SDMs attempt to 
model a taxon’s fundamental niche from occurrences 
that reflect the realised niche (Guisan et al. 2017). 
This can be problematic when niche parameters that 
are not captured by the model dictate where a species 
can occur, such as biotic interactions. While the 
output of this study may guide efforts to locate 
species where they have not been previously 
recorded, field records are needed to generate the 
final evidence. 


Applying this approach to other cryptic and 
potentially under-reported species occurring in areas 
yet poorly covered by observations may help to 
delimit distributions through citizen science. 
However, due to the correlative nature of SDMs, their 
output provides merely an indication of where a 
species may occur, while conclusive evidence 
remains dependent on field data. 
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ADDITIONAL RESOURCES 

The following GitHub repository provides this study’s 
Google Earth Engine script that was used to acquire and 
preprocess the predictors, as well as the R script that was 
used to preprocess the data and run the models: 
https://github.com/Manuel-Weber-ETH/Anthus_nyassae 
_trivialis. git 


